home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group99a.txt
/
000123_icon-group-sender _Fri Jun 4 08:03:30 1999.msg
< prev
next >
Wrap
Internet Message Format
|
2000-09-20
|
3KB
Return-Path: <icon-group-sender>
Received: (from root@localhost)
by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) id IAA19581
for icon-group-addresses; Fri, 4 Jun 1999 08:03:22 -0700 (MST)
Message-Id: <199906041503.IAA19581@baskerville.CS.Arizona.EDU>
Delivered-To: icon-group@silliac.cs.arizona.edu
Date: Fri, 4 Jun 99 00:51:50 -0500 (CDT)
From: gep2@terabites.com
Subject: Find the longest matching substrings
To: icon-group@optima.CS.Arizona.EDU
Errors-To: icon-group-errors@optima.CS.Arizona.EDU
Status: RO
> I am working on a project that will mark the inserted or deleted phases
between two strings. I am wondering whether you know any library
procedure that will anchor ( or report ) the longest phases that exists
in both strings.
I'm not aware of any such ready-made routine, although certainly "longest common
substring" logic is frequently (and heavily!) used in file compression software
(PKZIP for example).
> For example, assume that we have two strings:
old:="other strings ab cd e and more substrings"
new:="some more substrings ab cd ab cd e"
> This function should return phases and position in both strings, i.e.
[15, 28, "ab cd e"] for "ab cd e" is the longest phase that matched (
it has three continuous matching elements ).
Intriguing that you consider the "three matching elements" phrase (only seven
characters) 'longer' than the sixteen-character common phrase " more
substrings". I guess the criteria here could be most anything you want to make
it, but still... :-)
Generally the logic to do stuff like this (in good programs, anyhow) is coded at
very low levels using very heavily optimized code... since it tends to be very,
very compute-intensive (this is especially true as the two strings get longer).
Of course, the problem's solution might be VERY different depending on whether
you're looking for something that's "elegant" in terms of the source code,
versus something that's FAST, versus something that is storage-efficient, or
something else. (The rule about what substring is 'longer' also is going to
have an effect too!).
If I were going to try to make something like this FAST (and especially if one
of the strings were relatively constant) I'd probably want to use a hash (or
even direct index?) table pre-computed with the offsets of character substrings
(I'd probably use two-character adjacent pairs) in one of those strings; this
would enable stepping rapidly through the second string and quickly finding at
least (only just the) plausible candidates for launching more detailed substring
compare (and length computation) operations. This kind of thing would likely be
storage-hungry, but that's less of a concern today than it once was.
Gordon Peterson
http://web2.airmail.net/gep2/
Support the Anti-SPAM Amendment! Join at http://www.cauce.org/
12/19/98: the day the Conservatives demonstrated their scorn for their
fraudulent sham of representative government. Voters, remember it!